1. Introduction
Super-Resolution (SR) is the process by which a Low-Resolution (LR) image is upscaled, with the aim of enhancing both the image’s quality and level of detail. This operation enables the exposure of previously-hidden information which can then subsequently be used to improve the performance of any tasks depending on the super-resolved image. SR is thus highly desirable in a vast number of important applications such as medical imaging [
1,
2], remote sensing [
3,
4,
5], and in the identification of criminals depicted in Closed-Circuit Television (CCTV) cameras during forensic investigations [
6,
7].
Single Image SR (SISR) is typically formulated as the restoration of High-Resolution (HR) images that have been bicubically downsampled or blurred and downsampled. On these types of LR images, state-of-the-art (SOTA) SR models can achieve extremely high performance, either by optimising for high pixel fidelity to the HR image [
8,
9,
10,
11,
12,
13], or by improving perceptual quality [
14,
15,
16]. However, real-world images are often affected by additional factors such as sensor noise, complex blurring, and compression [
17,
18,
19], which further deteriorate the image content and make the restoration process significantly more difficult. Moreover, many SR methods are trained on synthetically generated pairwise LR–HR images which only model a subset of the potential degradations encountered in real-world imaging systems [
18,
20]. As a result, the domain gap between synthetic and realistic data often causes such SR methods to perform poorly in the real world, hindering their practical use [
6,
18,
21].
The field of blind SR is actively attempting to design techniques for image restoration which can deal with more realistic images containing unknown and complex degradations [
18]. These methods often break down the problem by first estimating the degradations within an image, after which this prediction is used to improve the performance of an associated SR model. Prediction systems can range from the explicit, such as estimating the shape/size of a specific blur kernel, to the implicit, such as the abstract representation of a degradation within a Deep Neural Network (DNN) [
18]. In the explicit domain, significant progress has been made in improving the accuracy and reliability of the degradation parameter estimation process. Recent mechanisms based on iterative improvement [
22,
23] and contrastive learning [
24,
25] have been capable of predicting the shape, size and noise of applied blur kernels with little to no error. However, such methods then go on to apply their prediction mechanisms with SR architectures that are smaller and less sophisticated than those used for SOTA non-blind SR.
In this work, we investigate how blind degradation prediction systems can be combined with any SR network with a Convolutional Neural Network (CNN) component, regardless of the network architecture. A robust system for the integration of the blind and SR components would allow for new techniques (both SR architectures or prediction mechanisms) to be immediately integrated and assessed under a blind setting. This would expedite and standardise the blind SR evaluation process, as well as allow any new mechanism to benefit from the latest SOTA architectures without requiring a complete redesign.
In our approach, we use a
metadata insertion block to link the prediction and SR mechanisms, an operation which interfaces degradation vectors with SR network feature maps. We implement a variety of SR architectures and integrate these with the latest techniques for contrastive learning and iterative degradation prediction. Our results show that by using just a single Meta-Attention (MA) layer [
26], high-performance SR models such as the Residual Channel Attention Network (RCAN) [
8] and the Holistic Attention Network (HAN) [
10] can be infused with degradation information to yield SR results which outperform those of the original blind SR networks trained under the same conditions.
We further extend our premise by performing blind degradation prediction and SR on images with blurring, noise and compression, constituting a significantly more complex degradation pipeline than that studied to-date by other prediction networks [
22,
23,
24,
25,
27]. We show that, even on such a difficult dataset, our framework is still capable of generating improved SR performance when combined with a suitable degradation prediction system.
The main contributions of this paper are thus as follows:
A framework for the integration of degradation prediction systems into SOTA non-blind SR networks.
A comprehensive evaluation of different methods for the insertion of blur kernel metadata into CNN SR networks. Specifically, our results show that simple metadata insertion blocks ( such as MA) can match the performance of more complex metadata insertion systems when used in conjunction with large SR networks.
Blind SR results using a combination of non-blind SR networks and SOTA degradation prediction systems. These hybrid models show improved performance over both the original SR network and the original blind prediction system.
A thorough comparison of unsupervised, semi-supervised and supervised degradation prediction methods for both simple and complex degradation pipelines.
The successful application of combined (i) blind degradation prediction and (ii) SR of images, degraded with a complex pipeline involving multiple types of noise, blurring and compression that is more reflective of real-world applications than considered by most SISR approaches.
The rest of this paper is organised as follows:
Section 2 provides an overview of related work on general and blind SR, including the methods selected for our framework.
Section 3 follows up with a detailed description of our proposed methodology for combining degradation prediction methods with SOTA SISR architectures. Our framework implementation details, evaluation protocol, degradation prediction and SR results are presented and discussed in
Section 4. Finally,
Section 5 provides concluding remarks and potential areas for further exploration.
2. Related Work
A vast number of methods have been proposed for SR, from the seminal Super-Resolution Convolutional Neural Network (SRCNN) [
28] to more advanced networks such as RCAN [
8], HAN [
10], Second-Order Attention Network (SAN) [
9], Super-Resolution GAN (SRGAN) [
14], and Enhanced SRGAN (ESRGAN) [
15] among others. Domain-specific methods have also been implemented, such as those geared for the super-resolution of face images (including Super-Face Alignment Network (Super-FAN) [
29] and the methods proposed in [
7,
30,
31,
32,
33,
34,
35,
36]), and satellite imagery [
3,
4,
37] among others.
Most approaches derive an LR image from a HR image using a degradation model, which formulates how this process is performed and the relationship between the LR and HR images. Hence, an overview of common degradation models, including the one used as the basis for the proposed SR framework, is first provided and discussed. Given that the proposed framework combines techniques designed for both general-purpose SR and blind SR, an overview of popular and SOTA networks for both methodologies will then be provided.
2.1. Degradation Models
Numerous works in the literature have focused on the degradation models considered in the SR process, which define how HR images are degraded to yield LR images. However, the formation of an LR image
can be generally expressed by the application of a function
f on the HR image
, as follows:
where
is the set of degradation parameters which, in practice, are unknown.
The function
f can be expanded to consider the general set of degradations applied to
, yielding the ‘classical’ degradation model as follows [
6,
18,
19,
20,
22,
23,
38,
39,
40,
41,
42,
43]:
where ⊗ represents the convolution operation,
k is a kernel (typically a Gaussian blurring kernel, although it can also represent other functions such as the Point Spread Function (PSF)),
n represents additive noise, and
is a downscaling operation (typically assumed to be bicubic downsampling [
18]) with scale factor
s. However, this model has been criticised for being too simplistic and unable to generalise well to more complex degradations that are found in real-world images, thereby causing substantial performance losses when SR methods based on this degradation model are applied to non-synthetic images [
17,
19,
39]. More complex and realistic degradation types have thus been considered, such as compression (which is typically signal-dependent and non-uniform, in contrast to the other degradations considered [
19]), to yield a more general degradation model [
17,
18,
44]:
where
C is a compression scheme such as JPEG.
The aim of SR is then to approximate the inverse function of
f, denoted by
. This function can be applied on the LR image (
) to reverse the degradation process and yield an image
approximating the original image (
):
where
represents the parameter set defining the reconstruction process. This degradation model forms the basis of the proposed SR framework.
Other works have extended the general model in Equation (
3) to more complex cases. One such work synthesises training pairs using a ‘high-order’ degradation process where the degradation model is applied more than once [
16]. The authors of [
19] proposed a ‘practical’ degradation model to train the ESRGAN-based BSRNet and BSRGAN models [
19], which consider multiple Gaussian blur kernels, downscaling operators, noise levels modelled by Additive White Gaussian Noise (AWGN), processed camera sensor noise types, and quality factors of JPEG compression. Random shuffling of the order in which the degradations are applied is also performed.
Counter-arguments to these complex models have also been made. For instance, the authors of [
17] argued that ‘practical’ degradation models as proposed in [
16,
19] (so called because a wide variety of degradations are considered, reflecting practical real-world applications) may achieve promising results on complex degradations but then ignore easier edge cases. Given that such cases primarily entail combinations of degradation subsets, a gated degradation model is proposed whereby the base degradations to be applied are randomly selected. Since the magnitudes of some degradations in the proposed framework may be reduced to the point where they are practically negligible, the degradation model forming the basis of this work (which is based on the processes defined in Equations (
3) and (
4) as previously discussed) can be said to approximate this gated mechanism. This also means that the proposed framework considers degradations found in practical real-world applications.
2.2. Non-Blind SR Methods
Most methods proposed for SR have tended to focus on the case where degradations are assumed to be known, either by designing models for specific degradations or by designing approaches that are able to use
supplementary information about the degradations afflicting the image. However, this information is neither estimated nor derived from the corrupted image, limiting the use of such methods in the real world where degradations are highly variable in terms of both their type and magnitude. Despite these limitations, non-blind SR methods have served an important role in enabling more rapid development of new techniques on what is arguably a simpler case of SR. An overview of notable methods will now be given.
The seminal non-blind SR method is considered to be the SRCNN network [
28], which was one of the first to use deep learning and CNNs for the task of SR. However, it only consists of three layers and requires the LR image to first be upsampled using bicubic interpolation, and is now outperformed by most modern approaches.
To facilitate the training of a large number of CNN layers, the Residual Network (ResNet) architecture proposed in [
45] introduced
skip connections to directly feed feature maps at any level of the network to deeper layers, a process corresponding to the identity function which deep networks find hard to learn. This counteracted the problem of
vanishing gradients apparent in classical deep CNN networks, allowing the authors of [
45] to expand their network size without impacting training performance. ResNet was extended to SR in [
14], to create the Super-Resolution Residual Network (SRResNet) approach that was also used as the basis of a Generative Adversarial Network (GAN)-based approach termed SRGAN.
SRGAN was extended in [
15] to yield ESRGAN, which included the introduction of a modified adversarial loss to determine the relative ‘realness’ of an image, and not just a decision as to whether the generated image is ‘real’ or ‘fake’. ESRGAN also introduced a modified VGG-based perceptual loss, which uses feature maps extracted from the VGG residual blocks right before the activation layers to reduce sparsity and better supervise brightness consistency and texture recovery. ESRGAN was further extended in [
16] to yield Real ESRGAN (Real-ESRGAN), where the focus was on the implementation of a ‘high-order’ degradation process that allowed the application of the degradation model more than once (in contrast to most works which only apply the model one time).
Enhanced Deep Super-Resolution (EDSR) [
46] was also based on ResNet and incorporated observations noted in previous works such as SRResNet, along with other novel contributions that had a large impact on subsequent CNN-based SR models. These included the removal of batch normalisation layers to disable restriction of feature values and reduce the memory usage during training that, in turn, allowed for a greater number of layers and filters to be used.
The RCAN approach proposed in [
8] is composed of ‘residual groups’ that each contain a number of ‘channel attention blocks’, along with ‘long’ and ‘short’ skip connections to enable the training of very deep CNNs. The channel attention blocks enable the assignment of different levels of importance of low-frequency information across feature map channels. The concept of attention introduced by RCAN was developed further by other methods such as SAN [
9] and HAN [
10], which proposed techniques such as channel-wise feature re-scaling and modelling of any inter-dependencies among channels and layers.
Recently, vision transformers applied to SR were also proposed, such as the Encoder-Decoder-based Transformer (EDT) [
47], Efficient SR Transformer (ESRT) [
48], and the Swin Image Restoration (SwinIR) [
12] approach that is based on the Swin Transformer [
49]. Approaches such as Efficient Long-Range Attention Network (ELAN) [
13] and Hybrid Attention Transformer (HAT) [
50], which attempt to combine CNN and transformer architectures, have also been proposed with further improvements in SR performance.
An extensive exposition of generic non-blind SR methods can be found in [
51,
52].
2.3. Blind SR Methods
Although numerous SR methods have been proposed, a substantial number of approaches tend to employ the classical degradation model (in Equation (
2)). Besides not being quite reflective of real-world degradations (as discussed in
Section 2.1), a substantial number of approaches also assume that the degradations afflicting an image are known. This is largely not the case, so that such approaches tend to exhibit noticeable performance degradation on images found ‘in-the-wild’.
As a result,
Blind SR methods have been designed for better robustness when faced with such difficult and unknown degradations, making them more suitable for real-world applications. There exist several types of blind SR methods, based on the type of data used and how they are modelled [
18]. An overview of the various types of approaches and representative methods will now be provided.
2.3.1. Approaches Utilising Supplementary Attributes for SR
Early work focused on the development of methods that are directly supplied with ground-truth information on degradations. The focus then turned on developing techniques to best utilise this degradation information (as opposed to non-blind SR methods which do not use any form of
supplementary information). Approaches of this kind generally consider the classical degradation model [
18].
Notable methods incorporating metadata information in networks include Super-Resolution network for Multiple Degradations (SRMD) [
39], Unified Dynamic Convolutional Network for Variational Degradations (UDVD) [
53], the Deep Plug-and-Play SR (DPSR) framework, and the approach in [
54]. Each of these approaches showed that SR networks could improve their performance by using this degradation information. Frameworks enabling the extension of existing non-blind SR methods to use degradation information have also been proposed, such as the ‘meta-attention’ approach in [
26] and Conditional hyper-network framework for SR with Multiple Degradations (CMDSR) proposed in [
55].
These methods clearly show the plausibility of improving SR performance with degradation metadata. However, apart from requiring some means of obtaining the degradation information, such methods are also highly reliant on the quality of the information input to the networks, which is not a trivial task. Moreover, any deviations in the estimated inputs can lead to kernel mismatches that may be detrimental to the SR performance [
18,
23,
38].
2.3.2. Iterative Kernel Estimation Methods
Blur kernel estimation during the SR process is one of the most common blind SR prediction tasks, and alleviates the problem of kernel mismatches present in methods such as SRMD as described above. Often, iterative mechanisms are applied for direct kernel estimation. One such method is Iterative Kernel Correction (IKC) [
22], which leverages the observation that kernel mismatch tends to produce regular patterns by estimating the degradation kernel and correcting it in an iterative fashion using a corrector network. In this way, an acceptable result is progressively approached. The authors of [
22] also proposed a non-blind SR network, named Spatial Feature Transform Multiple Degradations (SFTMD), which was shown to outperform existing methods (such as SRMD) that insert blur kernel metadata into the SR process.
The Deep Alternating Network (DAN) [
23] method (also known as DANv1) and its updated version DANv2 [
38] build upon the IKC approach, by combining the SR and kernel corrector networks within a single end-to-end trainable network. The corrector was also modified to use the LR input conditioned on intermediate super-resolved images, instead of conditioning these images on the estimated kernel as done in IKC. The Kernel-Oriented Adaptive Local Adjustment network (KOALAnet) [
56] is able to adapt to spatially-variant characteristics within an image, which allows a distinction to be made between blur caused by undesirable effects, and between blur introduced intentionally for aesthetic purposes (e.g., the Bokeh effect). However, given that such methods still rely on kernel estimation, they also exhibit poor performance when evaluated on images having different degradations than those used to train the model.
2.3.3. Training SR Models on a Single Image
Another group of methods such as KernelGAN [
57] and Zero-Shot SR (ZSSR) [
58] use intra- and inter-scale recurrence of patches, based on the internal statistics of natural images, to construct an individual model for each input LR image. Hence, the data used for training is that which is present internally within the image being super-resolved, circumventing the need to use an external dataset of images.
Such methods tend to assume that a downscaled version of a patch within a LR image should have a similar distribution to the patch in the original LR image. However, the assumption of recurring patches within and across scales may not hold true for all images (such as those containing a wide variety of content) [
18].
2.3.4. Implicit Degradation Modelling
Modelling an explicit combination of multiple degradation types can be a very complex task on images found ‘in-the-wild’. Hence, some approaches have also attempted to implicitly model the degradation process by comparing the data distribution of real-world LR image sets with synthetically generated ‘clean’ datasets (containing limited or no degradations) [
18]. Methods are typically based on GANs, such as Cycle-in-Cycle GAN (CinCGAN) [
59] and the approaches in [
60,
61], and do not require a HR reference for training.
However, one of the drawbacks of this type of method is that they tend to require vast amounts of data, which may not always be available. Some approaches, such as Degradation GAN [
62] and Frequency Separation for real-world SR (FSSR) [
63], attempt to counteract this issue by learning the HR-to-LR degradation process to generate realistic LR samples that can be used during the SR model training. However, most models designed for implicit degradation modelling use GANs which are known to be hard to train and may introduce fake textures or artefacts that can be detrimental for some real-world applications [
18].
2.3.5. Contrastive Learning
In the image classification domain, DNNs are known to be highly capable of learning invariant representations, enabling the construction of good classifiers [
64]. However, it has been argued that DNNs are actually too eager to learn invariances [
64]. This is because they often learn only the features necessary to discriminate between classes but then fail to generalise well to new unseen classes when employing a supervised setting in what is known as
supervision collapse [
64,
65]. Indeed, the ubiquitous cross-entropy loss used to train supervised deep classifier models has received criticism for several shortcomings [
66], such as its sensitivity to noisy labels [
67,
68] and the possibility of poor margins [
69,
70,
71].
Contrastive learning techniques, mostly developed in the Natural Language Processing (NLP) domain, have recently seen a resurgence and have driven significant advances in self-supervised representation learning in an attempt to mitigate these issues [
66]. Contrastive learning is a self-supervised approach where models are trained by comparing and contrasting ‘positive’ image pairs with ‘negative’ pairs [
72]. Positive images can be easily created by applying augmentations to a source image (e.g., flipping, rotations, colour jitter etc.). In the SR domain, positive samples are typically patches extracted from within the same image, while crops taken from other images are labelled as negative examples [
24,
72,
73].
One such contrastive learning-based approach is Momentum Contrast (MoCo), proposed in [
73] for the tasks of object detection, classification, and segmentation. MoCo employs a large queue of data samples to enable the use of a dictionary (containing samples observed in preceding mini-batches) which is much larger than the mini-batch size. However, since a large dictionary also makes it intractable to update the network parameters using back-propagation, a momentum update which tightly controls the parameters’ rate of change is also proposed. MoCo was extended in [
74] to yield MoCov2, based on design improvements proposed for SimCLR [
72]. The two main modifications constitute the replacement of the fully-connected layer at the head of the network with an Multi-Layer Perceptron (MLP) head, and the inclusion of blur augmentation.
Supervised MoCo (SupMoCo) [
64] was also proposed as an extension of MoCo, whereby class labels are additionally utilised to enable intra-class variations to be learnt whilst retaining knowledge on distinctive features acquired by the self-supervised components. SupMoCo was shown to outperform the Supervised Contrastive (SupCon) approach [
66], which also applied supervision to SimCLR [
72].
Such self-supervised methods have seen limited use in the SR domain thus far. However, the promising performance demonstrated in other domains could encourage further research and development in the SR arena. One of the first SR networks to use contrastive learning for blind SR was the Degradation-Aware SR (DASR) network [
24], where an unsupervised content-invariant degradation representation is obtained in a latent feature space where the mutual information among all samples is maximised.
In [
25], Implicit Degradation Modelling Blind SR (IDMBSR), considers the degrees of difference in degradation between a query image and negative exemplars, in order to determine the amount of ‘push’ to exert. Specifically, the greater the difference between a query and a negative example, the greater the push. In this way, degradation information is used as weak supervision of the representation learning to improve the network’s ability to characterise image degradations.
2.3.6. Other Methodologies
The mechanisms discussed in this section are some of the most prominent in the blind SR literature. However, many other modalities exist which do not neatly fall into any single category. One such method is Mixture of Experts SR (MoESR) [
42], where a panel of expert predictors is used to help optimise the prediction system for blur kernels of different shapes and sizes. A comprehensive overview of the state of blind SR research can be found in [
18].
2.4. Conclusions
As described above, there exist numerous algorithms, architectures, and frameworks designed to perform the task of SR, both for non-blind and blind scenarios. However, current approaches for both tasks suffer from major drawbacks. On the one hand, non-blind SR architectures have been optimised for high performance on synthetic data, but under-perform in practical degradation scenarios. On the other hand, blind SR networks and degradation prediction systems are more capable of dealing with real-world images but use weaker and more limited architectures that restrict their performance ceiling. These observations serve as the motivation to combine the best of both worlds, chiefly to leverage the performance of non-blind SR networks with the practical applications of degradation prediction systems.
Given the prominence of iterative and contrastive methods in blind SR, a number of these mechanisms were selected and implemented within the proposed framework. The principles of SupMoCo were also applied to construct a more controllable self-supervised contrastive learning function (
Section 3.5.2). However, in principle, any degradation estimation system could be coupled to any SR model, using the framework described in the rest of this paper.
4. Experiments & Results
4.1. Implementation Details
4.1.1. Datasets and Degradations
For analysis purposes, we created two LR degradation pipelines:
Simple pipeline (blurring and downsampling): For our metadata insertion screening and blind SR comparison, we used a degradation set of just Gaussian blurring and bicubic downsampling corresponding to the ‘classical’ degradation model described in Equation (
2). Apart from minimising confounding factors, this allowed us to make direct comparisons with pre-trained models provided by the authors of other blind SR networks. For all scenarios, we used only 21 × 21 isotropic Gaussian kernels with a random width (
) in the range
(as recommended in [
16]), and ×4 bicubic downsampling. The
parameter was then normalised to the closed interval
before it was passed to the models.
Complex pipeline: In our extended blind SR training schemes, we used a full degradation pipeline as specified in Equation (
3), i.e., sequential blurring, downsampling, noise addition and compression. For each operation in the pipeline, a configuration was randomly (from a uniform distribution) selected from the following list:
- −
Blurring: As proposed in [
16], we sampled blurring from a total of 7 different kernel shapes: iso/anisotropic Gaussian, iso/anisotropic generalised Gaussian, iso/anisotropic plateau, and sinc. Kernel
values (both vertical and horizontal) were sampled from the range
, kernel rotation ranged from
to
(all possible rotations) and the shape parameter
ranged from
for both generalised Gaussian and plateau kernels. For sinc kernels, we randomly selected the cutoff frequency from the range
. All kernels were set to a size of 21 × 21 and, in each instance, the blur kernel shape was randomly selected, with equal probability, from the 7 available options. For a full exposition on the selection of each type of kernel, please refer to [
16].
- −
Downsampling: As in the initial model screening, we again retained ×4 bicubic downsampling for all LR images.
- −
Noise addition: Again following [
16], we injected noise using one of two different mechanisms, namely Gaussian (signal independent read noise) and Poisson (signal dependent shot noise). Additionally, the noise was either independently added to each colour channel (colour noise), or applied to each channel in an identical manner (grey noise). The Gaussian and Poisson mechanisms were randomly applied with equal probability, grey noise was selected with a probability of 0.4, and the Gaussian/Poisson sigma/scale values were randomly sampled from the ranges
and
, respectively.
- −
Compression: We increased the complexity of compression used in previous works by randomly selecting from either JPEG or JM H.264 (version 19) [
76] compression at runtime. For JPEG, a quality value was randomly selected from the range
(following [
16]). For JM H.264, images were compressed as single-frame YUV files where a random I-slice Quantization Parameter (QPI) was selected from the range
, as discussed in [
26].
To allow for a fair (and direct) comparison to other works, the training, validation and testing HR datasets we selected are identical to those used in the SR works we used as baselines or comparison points (
Section 4.1.2). Thus, all our models were trained on LR images generated from HR images of DIV2K [
77] (800 images) and Flickr2K [
78] (2650 images). Validation and best model selection were performed on the provided DIV2K validation set (100 images).
For final results comparison, the standard SR test sets Set5 [
79], Set14 [
80], BSDS100 [
81], Manga109 [
82] and Urban100 [
83] were utilised. For these test images, the parameters of each degradation were explicitly selected. The exact degradation details for each scenario are specified in all the tables and figures presented. The super-resolved images were compared with the corresponding target HR images using several metrics during testing and validation, namely Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity index (SSIM) [
84] (direct pixel comparison metrics) and Learned Perceptual Image Patch Similarity (LPIPS) [
85] (perceptual quality metric). In all cases, images were first converted to YCbCr, and the Y channel used to compute metrics.
4.1.2. Model Implementation, Training and Validation
Due to the diversity of the models investigated in this work, a number of different training and validation schemes were followed depending on the task and network being investigated:
Non-blind SR model training: For non-blind model training, we initialised the networks with the hyperparameters as recommended by their authors, unless otherwise specified. All models were trained from scratch on LR–HR pairs generated from the DIV2K and Flickr2K datasets using either the simple or complex pipelines. For the simple pipeline, one LR image was generated from each HR image. For the complex pipeline, five LR images were generated per HR image to improve the diversity of degradations available. In both cases, the LR image set was generated once and used to train all models. All simple pipeline networks were trained for 1000 epochs, whereas the complex pipeline networks were trained for 200 epochs to ensure fair comparisons (since each epoch contains 5 times as many samples as in the simple case). This training duration (in epochs) was chosen as a compromise between obtaining meaningful results and keeping the total training time low.
For both pipelines, training was carried out on 64 × 64 LR patches. The Adam [
87] optimiser was used. Variations in batch size and learning rate scheduling were made for specific models as necessary in order to ensure training stability and limit Graphical Processing Unit (GPU) memory requirements. The configurations for the non-blind SR models tested are as follows:
- −
RCAN [
8] and HAN [
10]: We selected RCAN to act as our baseline model as a compromise between SR performance and architectural simplicity. To push performance boundaries further, we also trained and tested HAN as a representative SOTA pixel-quality CNN-based SR network. For these models, the batch size was set to 8 in most cases, and a cosine annealing scheduler [
88] was used with a warm restart after every 125,000 iterations and an initial learning rate of 1 × 10
−4. Training was driven solely by the L1 loss function which compares the SR image with the target HR image. After training, the epoch checkpoint with the highest validation PSNR was selected for final testing.
- −
Real-ESRGAN [
16]: We selected Real-ESRGAN as a representative SOTA perceptual quality SR model. The same scheme described for the original implementation was used to train this network. This involved two phases: (i) a pre-training stage where the generator was trained with just an L1 loss, and (ii) a multi-loss stage where a discriminator and VGG perceptual loss network were introduced (further details are provided in [
16]). We pre-trained the model for 715 and 150 epochs (which match the pretrain:GAN ratio as originally proposed in [
16]) for the simple and complex pipelines, respectively. In both cases, the pre-training optimiser learning rate was fixed at
, while the multi-loss stage involved a fixed learning rate of
. A batch size of 8 was used in all cases. After training, the model checkpoint with the lowest validation LPIPS score in the last 10% of epochs was selected for testing.
- −
ELAN [
13]: We also conducted a number of experiments with ELAN, a SOTA transformer-based model. For this network a batch size of 8 and a constant learning rate of
were used in all cases. As with RCAN and HAN, the L1 loss was used to drive training and the epoch checkpoint with the highest validation PSNR was selected for final testing.
Iterative Blind SR: Since the DAN iterative scheme requires the SR image to improve its degradation estimate, the predictor model needs to be trained simultaneously with the SR model. We used the same CNN-based predictor network described in DANv1 [
23] for our models and fixed the iteration count to four in all cases (matching the implementation as described in [
23]). We coupled this predictor with our non-blind SR models using the framework described in
Section 3.2. We trained all DAN models by optimising for the SR L1 loss (identical to the non-blind models) and an additional L1 loss component comparing the prediction and ground-truth vectors. Target vectors varied according to the pipeline, the details of each are provided in their respective results sections. For each specific model architecture, the hyperparameters and validation selection criteria were all set to be identical to that of the base, non-blind model. The batch size for all models was adjusted to 4 due to the increased GPU memory requirements needed for the iterative training scheme. Accordingly, whenever a warm restart scheduler was used, the restart point was adjusted to 250,000 iterations (to maintain the same total number of iterations as performed by the other models that utilised a batch size of 8 for 125,000 iterations).
Additionally, we also trained the original DANv1 model from scratch, using the same hyperparameters from [
23] and the same validation scheme as the other DAN models. The batch size was also fixed to 4 in all cases.
Contrastive Learning: We used the same encoder from [
24] for most of our contrastive learning schemes. This encoder consists of a convolutional core connected to a set of three fully-connected layers. During training, we used the output of the fully-connected layers (
Q) to calculate loss values (i.e.,
and
in Section
5) and update the encoder weights, following [
24]. Before coupling the encoder with an SR network, we first pre-trained the encoder directly. For this pre-training, the batch size was set to 32 and data was generated online, i.e., each LR image was synthesised on the fly at runtime. All encoders were trained with a constant learning rate of
, a patch size of 64 × 64 and the Adam optimiser. The encoders were trained until the loss started to plateau and t-Distributed Stochastic Neighbour Embedding (t-SNE) clustering of degradations generated on a validation set composed of 400 images from CelebA [
89] and BSDS200 [
81] was clearly visible (more details on this process are provided in
Section 4.3.1). In all cases, the temperature hyperparameter, momentum value, queue length, and encoder output vector size were set to 0.07, 0.999, 8192 and 256, respectively, (matching the models from [
24]).
After pre-training, each encoder was coupled to non-blind SR networks using the framework discussed in
Section 3.2. For standard encoders, the
encoding (i.e., the output from the convolutional core that bypasses the fully-connected layers) is typically fed into metadata insertion blocks directly, unless specified. For encoders with a regression component (see
Figure 2B), the dropdown output is fed to the metadata insertion block instead of the encoding. The combined encoder + SR network was then trained using the same dataset and hyperparameters as the non-blind case. The encoder weights were frozen and no gradients were generated for the encoding at runtime, unless specified.
In our analysis, we use the simple pipeline as our primary blind SR task and the complex pipeline as an extension scenario for the best performing methods. In
Section 4.2 we discuss our metadata insertion block testing, while
Section 4.3 and
Section 4.4 present our degradation prediction and SR analysis on the simple pipeline respectively.
Section 4.5 and
Section 4.6 follow up with our analysis on the complex pipeline and
Section 4.7 presents some of our blind SR results on real-world degraded images.
4.2. Metadata Insertion Block Testing
To test and compare the various metadata insertion blocks selected, we implemented each block into RCAN, and trained a separate model from scratch on our simple pipeline dataset. Each metadata insertion block was given either the real blur kernel width (normalised in the range
) or the PCA-reduced kernel representation, for each LR image. The PSNR test results for each model have been compiled in
Table 1, and plotted in a comparative bar graph in
Figure 5. The SSIM results are also available in the
supplementary information (Table S1).
From the results, it is evident that metadata insertion provides a significant boost to performance across the board. Somewhat surprisingly, the results also show that no single metadata insertion block has a clear advantage over the rest. Every configuration tested, including those where multiple metadata insertion blocks are provided, produces roughly the same level of performance with only minor variations across dataset/degradation combinations. This outcome suggests that each metadata block is producing the same amount of useful information from the input kernel. Further complexity, such as the DA block’s kernel transformation or the SFT/DGFMB feature map concatenation, provides no further gain in performance. Even adding further detail to the metadata, such as by converting the full blur kernel into a PCA-reduced vector, provides no performance gains. This again seems to suggest that the network is capable of extrapolating the kernel width to the full kernel description, without requiring any additional data engineering. Furthermore, adding just a single block at the beginning of the network appears to be enough to inform the whole network, with additional layers providing no improvement (while a decrease in performance is actually observed in the case of DA). We hypothesise that this might be due to the fact that degradations are mostly resolved in the earlier low-frequency stages of the network.
Given that all metadata insertion blocks provide almost identical performance, we selected a single MA block for our blind SR testing, given its low overhead and simplicity with respect to the other approaches. While it is clear the more complex metadata insertion blocks do not provide increased performance on this dataset, it is still possible that they might provide further benefit if other types of metadata are available.
4.3. Blur Kernel Degradation Prediction
To test our degradation prediction mechanisms, we evaluated the performance of these methods on a range of conditions and datasets.
4.3.1. Contrastive Learning
For contrastive learning methods, the prediction vectors generated are not directly interpretable. This makes it difficult to quantify the accuracy of the prediction without some form of clustering/regression analysis. However, through the use of dimensionality reduction techniques such as t-SNE [
90], the vectors can be easily reduced to 2-D, which provides an opportunity for qualitative screening of each model in the form of a graph.
We trained separate encoders using our three contrastive algorithms on the simple degradation pipeline. The testing epoch and training details for each model are provided in
Table 2. For SupMoCo schemes, the models were trained with triple precision labelling (low/medium/high labels) with respect to the blur
value. For WeakCon, the weighting
was found by calculating the Euclidean distance between the normalised kernel widths (
) of the query and negative samples (as proposed in [
25]). The number of training epochs for each encoder was selected qualitatively, based on when the contrastive loss started to plateau, and after degradation clustering could be qualitatively observed on the validation set. Training the encoders beyond this point seemed to provide little to no performance gain, as is shown in
Table 3.
We used the trained encoders to generate prediction vectors for the entirety of the BSDS100, Manga109 and Urban100 testing datasets, with each image degraded with three different kernel widths (927 images in total), and then applied t-SNE reduction for each set of outputs. We also fed this test set to a DASR pretrained encoder (using the weights provided in [
24]) for direct comparison with our own networks. The t-SNE results are presented in
Figure 6.
It is immediately apparent that all of the models achieve some level of separation between the three different
values. However, the semi-supervised methods produce very clear clustering (with just a few outliers) while the MoCo methods generate clusters with less well-defined edges. The influence of the labelling systems clearly produces a very large repulsion effect between the different
widths, which the unsupervised MoCo system cannot match. Interestingly, there is no discernable distinction between the WeakCon and SupMoCo plots, despite their different modes of action. Additionally, minor modifications to the training process such as swapping the encoder for a larger model (e.g., ResNet) or continuing to train the predictor in tandem with an SR model (SR results in
Section 4.4) appear to provide no benefit or even degrade the output clusters.
4.3.2. Regression Analysis
For our iterative and regression models, the output prediction is much simpler to interpret. Direct
and PCA kernel estimates can be immediately compared with the actual value. We trained a variety of iterative DAN models, using RCAN as our base SR model for consistency. Several separate RCAN-DAN models were implemented; one specifically predicting the
and others predicting a 10-element PCA representation of each kernel. We also trained two DANv1 models (predicting PCA kernels) from scratch for comparison: one using a fixed learning rate of
(matching the original implementation in [
23]) and one using our own cosine annealing scheduler with a restart value of 250,000 (matching our other models). We compared the prediction capabilities of our models, and a number of pretrained checkpoints from the literature, on our testing sets (blurred with multiple values of
). The pretrained models for IKC, DANv1 and DANv2 were extracted from their respective official code repositories. These pretrained models were also trained on DIV2K/Flickr2K images, but training degradations were generated online (with
in the range
), which should result in superior performance.
Figure 7A shows the prediction error of the direct regression models that were trained (both contrastive and iterative models). The results clearly show that the DAN predictor is the strongest of those tested, with errors below 0.05 in some cases (representing an error of less than 2.5%). The contrastive/regression methods, while producing respectable results in select scenarios, seem to suffer across most of the distribution tested. In both types of models, the error seems to increase when the width is at its lower range. We hypothesise that, at this point, it is difficult to distinguish between
of 0.2–0.4, given that the corresponding kernels are quite small.
Figure 7B shows the results of the PCA prediction models. The plot shows that our RCAN-DAN models achieve very similar prediction performance to the pretrained DANs. What makes this result remarkable is the fact that our models were trained for much less time than the pretrained DAN models, both of which were trained for ≈7000 epochs. Training DANv1 from scratch for the same amount of time as our models (1000 epochs) shows that the prediction performance at this point is markedly worse. It is clear that the larger and more capable RCAN model is helping boost the
prediction performance significantly. On the other hand, the pretrained IKC model is significantly outclassed by all DAN models in almost all scenarios. It is also worth noting that the prediction of kernels at the lower end of the spectrum suffers from increased error, across the board.
4.4. Blind SR on Simple Pipeline
The real test for our combined SR and predictor models is the blind SR performance.
Table 3 presents the blind SR PSNR results of all the models considered on the test sets under various levels of blur
. SSIM results are also provided in the
supplementary information (Table S2).
Figure 8 and
Figure 9 further complement these results, with a bar chart comparison of key models and a closer look at the SR performance across various levels of of
, respectively.
With reference to the model categories highlighted in
Table 3, we make the following observations:
Additionally, we implemented, trained and tested the Real-ESRGAN and ELAN models with the addition of the MA metadata insertion block (with the same hyperparameters as presented in
Section 4.1). The testing results are available in the
supplementary information (
Table S3 containing Real-ESRGAN LPIPS results, and
Tables S4 and S5 containing the PSNR and SSIM results for ELAN, respectively). For Real-ESRGAN, the addition of the true metadata (non-blind) makes a clear improvement over the base model. We also observed a consistent improvement in performance across datasets and
values for the DAN upgraded model. However, attaching the best performing SupMoCo encoder provided no clear advantage. We hypothesise that the Real-ESRGAN model is more sensitive to the accuracy of the kernel prediction, and thus sees limited benefit from the less accurate contrastive encoder (as we have shown for the DAN vs. contrastive methods (
Figure 7)).
For ELAN, the baseline model is very weak, and is actually surpassed by Lanczos upsampling in one case (both in terms of PSNR and SSIM). The addition of the true metadata only appeared to help when MA was distributed through the whole network, upon which it increased the performance of the network massively (>3 dB in some cases). It is clear that ELAN does not perform well on these blurred datasets (ELAN was originally tested only on bicubically downsampled datasets). However, MA still appears to be able to significantly improve the model’s performance under the right conditions. Further investigation is required to first adapt ELAN for such degraded datasets before attempting to use this model as part of our blind framework.
4.5. Complex Degradation Prediction
For our extended analysis on more realistic degradations, we trained three contrastive encoders (MoCo, SupMoCo and WeakCon) and one RCAN-DAN model on the complex pipeline dataset (
Section 4.1). Given the large quantity of degradations, we devised a number of testing scenarios, each applied on the combined images of BSDS100, Manga109 and Urban100 (309 images total). The scenarios we selected are detailed in
Table 4. We will refer to these testing sets for the rest of this analysis. We evaluated the prediction capabilities of the contrastive and iterative models separately. We purposefully limited the testing blur kernel shapes to isotropic/anisotropic Gaussians to simplify analysis.
4.5.1. Contrastive Learning
For each of the contrastive algorithms, we trained an encoder (all with the same architecture as used for the simple pipeline) with the following protocol:
We first pre-trained the encoder with an online pipeline of noise (same parameters as the full complex pipeline, but with an equal probability to select grey or colour noise) and bicubic downsampling. We found that this pre-training helps reduce loss stagnation for the SupMoCo encoder, so we applied this to all encoders. The SupMoCo encoder was trained with double precision at this stage. We used 3 positive patches for SupMoCo and 1 positive patch for both MoCo and WeakCon.
After 1099 epochs, we started training the encoder on the full online complex pipeline (
Section 4.1). The SupMoCo encoder was switched to triple precision from this point onwards.
We stopped all encoders after 2001 total epochs, and evaluated them at this checkpoint.
For SupMoCo, the decision tree in
Section 3.5.2 was used to assign class labels. For WeakCon,
was computed as the Euclidean distance between query/negative sample vectors containing: the vertical and horizontal blur
, the Gaussian/Poisson sigma/scale, respectively, and the JPEG/JM H.264 quality factor/QPI, respectively, (6 elements total). All values were normalised to
prior to computation.
As with the simple pipeline, contrastive encodings are not directly interpretable and so we analysed the clustering capabilities of each encoder through t-SNE visualizations. We evaluated each encoder on the full testing scenario (
Iso/Aniso + Gaussian/Poisson + JPEG/JM in
Table 4), and applied t-SNE independently for each model. The results are shown in
Figure 10.
It is evident from the t-SNE plots that the clustering of the dataset is now significantly more complex than that observed in
Figure 6. However, all three encoders appear to have successfully learnt how to distinguish between the two compression types and are also mostly successful when clustering the four types of noise (MoCo is slightly weaker for grey noise). In the
supplementary information (Figure S1), we also show that the encoders are capable of separating different intensities of both compression and noise, albeit with less separation of the two noise types.
For blurring, the separation between isotropic and anisotropic kernels is much less logical. It appears that each encoder was attempting to form sub-clusters for each type of kernel in some cases (in particular for SupMoCo) but the separation is significantly less clear cut than that obtained in
Figure 6. Further analysis would be required to decipher whether clustering is weak simply due to the difficulty of the exercise, or whether clustering is being mostly influenced by the other degradations considered in the pipeline.
As observed with the simple pipeline, it is again apparent that the different methods of semi-supervision seem to be converging to similar results. This is also in spite of the fact that WeakCon was supplied with only 6 degradation elements while SupMoCo was supplied with the full degradation metadata through its class system. Further investigation into their learning process could reveal further insight into the effects of each algorithm.
4.5.2. Iterative Parameter Regression
The RCAN-DAN model was trained on the complex pipeline dataset with identical hyperparameters to that of the simple pipeline. For degradation prediction, we set the DAN model to predict a vector with the following elements (15 total):
Individual elements for the following blur parameters: vertical and horizontal
, rotation, individual
for generalised Gaussian and plateau kernels and the sinc cutoff frequency. Whenever one of these elements was unused (e.g., cutoff frequency for Gaussian kernels), this was set to 0. All elements were normalised to
according to their respective ranges (
Section 4.1).
Four boolean (0 or 1) elements categorising whether the kernel shape was:
Isotropic or anisotropic
- −
Generalised
- −
Plateau-type
- −
Sinc
Individual elements for the Gaussian sigma and Poisson scale (both normalised to ).
A boolean indicating whether the noise was colour or grey type.
Individual elements for the JM H.264 QPI and JPEG quality factor (both normalised to ).
We tested the prediction accuracy by evaluating the model on a number of our testing scenarios, and then quantifying the degradation prediction error. The results are shown in
Table 5. As observed with the contrastive models, blur kernel parameter prediction accuracy is extremely low, even when no other degradations are present. On the other hand, both noise and compression prediction are significantly better, with sub 0.1 error in all cases, even when all degradations are present. We hypothesise that since blur kernels are introduced as the first degradation in the pipeline, most of the blurring information could be masked when noise addition and compression have been applied.
To the best of our knowledge, we are the first to present fully explicit blind degradation prediction on this complex pipeline. We hope that the prediction results achieved in this analysis can act as a baseline from which further advances and improvements can be made.
4.6. Blind SR on Complex Pipeline
For blind SR on the complex pipeline, we focus on just the RCAN and RCAN upgraded models to simplify analysis. We use a single MA block to insert metadata into the SR core in all cases apart from one, where we distribute MA throughout RCAN. We also trained a number of non-blind models (fed with different quantities of the correct metadata) as comparison points. PSNR SR results comparing the baseline RCAN to the blind models are provided in
Table 6 (
Table S6 in the Supplementary Information provides the SSIM results).
We make the following observations on these results:
Compression- and noise-only scenarios: In these scenarios, the RCAN-DAN model shows clear improvement over all other baseline and contrastive encoders (apart from some cases on Manga109). Improvement is most significant in the compression scenarios.
Blur-only scenarios: Since the blurring scenarios are very similar or identical to the simple pipeline, the models from
Table 3 (RCAN is also shown in
Table 6) are significantly stronger. The DAN model overtakes the baseline in some cases, but is very inconsistent.
Multiple combinations: In the multiple degradation scenarios, the DAN model consistently overtakes the baselines, but PSNR/SSIM increases are minimal.
For all scenarios, there are a number of other surprising results. The contrastive methods appear to be providing no benefit to SR performance, in almost every case. Furthermore, the non-blind models are often overtaken by the DAN model in certain scenarios, and the amount of metadata available to the non-blind models does not appear to correlate with the final SR performance. It is clear that the metadata we have available for these degradations are having a much lesser impact on SR performance than on the simple pipeline. Since the contrastive encoders have shown to be slightly weaker than DAN in the simple pipeline case (
Figure 7), it is clear that their limited prediction accuracy is also limiting potential gains in SR performance on this pipeline. This dataset is significantly more difficult than the simple case, not just due to the increased amount of degradations, but also as the models appear less receptive to the insertion of metadata. We again hope that these results can act as a baseline for further exploration into complex blind SR.
4.7. Blind SR on Real LR Images
As a final test to compare models from both pipelines, we ran a select number of models on real-world images from RealSRSet [
19]. These results are shown in
Figure 11, with an additional image provided in the
supplementary information (Figure S2). This qualitative inspection clearly show that models trained on the complex pipeline are significantly better at dealing with real-world degradations than simple pipeline models.
Figure 11 shows that the complex pipeline models can remove compression artefacts, sharpen images and smoothen noise. In particular, the dog image shows that RCAN-DAN can deal with noise more effectively than the baseline RCAN. The simple pipeline model results are all very similar to each other, as none of them are capable of dealing with degradations other than isotropic blurring.
4.8. Results Summary
Given the large quantity of analyses conducted, we provide a brief summary of the most significant results obtained in each section here:
In
Section 4.2, we show that all of the metadata insertion mechanisms tested provide roughly the same SR performance boost when feeding a large network such as RCAN with non-blind blurring metadata. Furthermore, adding repeated blocks through the network provides little to no benefit. Given this result, we propose MA as our metadata insertion block of choice, as it provides identical SR performance as the other options considered, with very low complexity. Other metadata blocks could prove optimal in other scenarios (such as other degradations or with other networks), which would require further systematic investigation to determine.
Section 4.3 provides a comparison of the prediction performance of the different algorithms considered on the simple blur pipeline. The contrastive algorithms clearly cluster images by the applied blur kernel width, with the semi-supervised algorithms providing the most well-defined separation between different values. The regression and iterative mechanisms are capable of explicitly predicting the blur kernel width with high accuracy, except at the lower extreme. Our prediction mechanisms combined with RCAN match the performance of the pretrained DAN models with significantly less training time.
Section 4.4 compares the testing results of blind models built with our framework with baseline models from the literature. Each prediction mechanism considered elevates RCAN’s SR performance above its baseline value, for both PSNR and SSIM. In particular, the iterative mechanism provided the largest performance boost. For more complex models such as HAN and Real-ESRGAN, contrastive methods provide less benefit, but the iterative mechanism still shows clear improvements. Our models significantly overtake the SOTA blind DAN network when trained for the same length of time. In addition, our models approach or surpass the performance of the pretrained DANv1 and DANv2 checkpoints provided by their authors, which were trained for a significantly longer period of time.
In
Section 4.5 and
Section 4.6, we modify our prediction mechanisms to deal with a more complex pipeline of blurring, noise and compression, and attach these to the RCAN network. We show that the contrastive predictors can reliably cluster compression and noise, but blur kernel clustering is significantly weaker. Similarly, the iterative predictors are highly accurate when predicting compression/noise parameters, but are much less reliable for blur parameters.
When testing their SR performance, the contrastive encoders seem to provide little to no benefit to RCAN’s performance. On the other hand, the DAN models reliably improve the baseline performance across various scenarios, albeit with limited improvements when all degradations are present at once. We anticipate that performance can be significantly improved with further advances to the prediction mechanisms and consider our results as a baseline for further exploration.
Section 4.7 showcases the results of our models when applied to real-world LR images. Our complex pipeline models produce significantly better results than the pretrained DAN models and are capable of reversing noise, compression and blurring in various scenarios.
5. Conclusions
In this work, a framework for combining degradation prediction systems with any SR network was proposed. By using a single metadata insertion block to influence the feature maps of a convolutional layer, a degradation vector from a prediction model can, in many cases, be used to improve the performance of the SR network. This premise was tested by implementing several contrastive and iterative degradation prediction mechanisms and coupling them with high-performing SR architectures. When tested on a dataset of images that were degraded by Gaussian blurring and downsampling, we show that our blind mechanisms achieve at least the same (or better) blur
prediction accuracy as the original methods, but with significantly less training time. Moreover, both blind degradation performance (in combined training cases, such as with DAN) and SR performance are substantially improved through the use of larger and stronger networks such as RCAN [
8] or HAN [
10]. Our results show that our hybrid models surpass the performance of the baseline non-blind and blind models under the same conditions. Other SR architecture categories such as the SOTA perceptual-loss based Real-ESRGAN [
16] and the transformer-based ELAN architecture [
13] also work within our framework, but the performance of these methods is more sensitive to the accuracy of the degradation prediction and the dataset used for training. We show that this premise also holds true for blind SR of a more complex pipeline involving various blurring, noise injection, and compression operations.
Our framework should enable blind SR research to be significantly expedited, since researchers can now focus their efforts on their degradation prediction mechanisms, rather than on deriving a custom SR architecture for each new method. There are various future avenues that could be explored to further assess the applications of our framework. Apart from investigating new combinations of blind prediction, metadata insertion and SR architectures, our framework could also be applied to new types of metadata. For example, blind prediction systems could be replaced with image classification systems. This will provide SR architecture with details on the image content (e.g., facial features for face SR [
36]). Furthermore, the framework can be extended to video SR [
91] where additional sources of metadata are available, such as the number of frames to be used in the super-resolution of a given frame as well as other details on the compression scheme, such as P- and B-frames (in addition to I-frames as considered in this work).